Web-scale profiling of semantic annotations in HTML pages

نویسنده

  • Robert Meusel
چکیده

The vision of the Semantic Web was coined by Tim Berners-Lee almost two decades ago. The idea describes an extension of the existing Web in which “information is given well-defined meaning, better enabling computers and people to work in cooperation” [Berners-Lee et al., 2001]. Semantic annotations in HTML pages are one realization of this vision which was adopted by large numbers of web sites in the last years. Semantic annotations are integrated into the code of HTML pages using one of the three markup languages Microformats, RDFa, or Microdata. Major consumers of semantic annotations are the search engine companies Bing, Google, Yahoo!, and Yandex. They use semantic annotations from crawled web pages to enrich the presentation of search results and to complement their knowledge bases. However, outside the large search engine companies, little is known about the deployment of semantic annotations: How many web sites deploy semantic annotations? What are the topics covered by semantic annotations? How detailed are the annotations? Do web sites use semantic annotations correctly? Are semantic annotations useful for others than the search engine companies? And how can semantic annotations be gathered from the Web in that case? The thesis answers these questions by profiling the web-wide deployment of semantic annotations. The topic is approached in three consecutive steps: In the first step, two approaches for extracting semantic annotations from the Web are discussed. The thesis evaluates first the technique of focused crawling for harvesting semantic annotations. Afterward, a framework to extract semantic annotations from existing web crawl corpora is described. The two extraction approaches are then compared for the purpose of analyzing the deployment of semantic annotations in the Web. In the second step, the thesis analyzes the overall and markup language-specific adoption of semantic annotations. This empirical investigation is based on the largest web corpus that is available to the public. Further, the topics covered by deployed semantic annotations and their evolution over time are analyzed. Subsequent studies examine common errors within semantic annotations. In addition, the thesis analyzes the data overlap of the entities that are described by semantic annotations from the same and across different web sites. The third step narrows the focus of the analysis towards use case-specific issues. Based on the requirements of a marketplace, a news aggregator, and a travel portal the thesis empirically examines the utility of semantic annotations for these use cases. Additional experiments analyze the capability of product-related semantic annotations to be integrated into an existing product categorization schema. Especially, the potential of exploiting the diverse category information given by the web sites providing semantic annotations is evaluated.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Epiphany: Adaptable RDFa Generation Linking the Web of Documents to the Web of Data

The appearance of Linked Open Data (LOD) was an important milestone for reaching a Web of Data. More and more RDF data sets get published to be consumed and integrated into a variety of applications. Pointing out one application, Linked Data can be used to enrich web pages with semantic annotations. This gives readers the chance to recall Semantic Web’s knowledge about text passages. RDFa provi...

متن کامل

Annotating Virtual Web Documents with DynamicsMarks

Annotating Web pages facilitates document management and collaboration on the WWW, and is also a key factor of user-driven metadata creation on the Semantic Web. However, though there exists a broad number of annotation systems for static HTML, none of them supports the pervasive annotation of dynamically generated Web content. The DynamicMarks annotation system introduced in this paper relies ...

متن کامل

Study of Design Issues on an Automated Semantic Annotation System

The semantic annotation process turns ordinary HTML web pages into machine-understandable semantic web pages. We have proposed a semantic annotation system to automate annotation of web pages by ontologies. In this paper, we present a study of five design issues on this system, which include: (1) the covering of web pages; (2) the general paradigm of annotation process; (3) the measurements of ...

متن کامل

Intra/Inter-document Change Awareness for Co-authoring of Web Sites

Systems that support the co-authoring of web sites often allow users to freely edit pages. This can result in semantic inconsistencies within and between pages. We propose a change awareness mechanism that monitors intraand inter-document edits, taking into account changes made to a page and pages connected to it through html or transclusion links. The effect of all the changes is computed base...

متن کامل

Conceptual Graphs and Annotated Semantic Web Pages

Semantic Web aims at turning Internet into a machine understandable resource, which requires the existence of ontologies, methods for ontology mapping and pages annotated with semantic markup. It is clear that the manual annotation of pages is not feasible and the fully automatic one is impossible so the current trend is creation of tools for semi-automatic page annotation. Platforms like KIM a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017